Day 3: PCA analysis and clustering demostration

1. Creating a synthetic Fungal Trait Dataset

Let’s start by generating a simple synthetic dataset of 4 fungal traits to show the analysis process. We’ll imagine 10 hypothetical fungal species (labeled A through J for simplicity). For each species, we have measured:

  • HyphalExtensionRate (mm/day) – e.g. how fast the fungus’s mycelium grows.
  • EnzymeActivity (arbitrary units) – representing extracellular enzyme production.
  • SporeProduction (count per culture or per unit time) – number of spores produced (indicates reproductive output and dispersal potential).
  • DroughtTolerance (percentage survival or growth under dry conditions) – a measure of stress tolerance (higher means the fungus can withstand drought stress better).

Create synthetic fungal trait data

(In practice, this might be loaded from a file or measured experimentally, but here we explicitly define it for demonstration.)

# Define species names
species <- c("Species_A","Species_B","Species_C","Species_D",
             "Species_E","Species_F","Species_G","Species_H",
             "Species_I","Species_J")

# Manually assign trait values for each species:
hyphal_ext <- c(10, 9, 12, 11,    # A-D: high extension rates
                2, 4, 3, 5,       # E-H: low extension rates
                6, 7)             # I-J: moderate extension

enzyme_activity <- c(2, 4, 3, 2,  # A-D: low enzyme activity
                     9, 8, 7, 9,  # E-H: high enzyme activity
                     4, 5)        # I-J: moderate enzyme

spore_prod <- c(900, 1100, 950, 1200,   # A-D: high spore production
                80, 150, 120, 90,      # E-H: low spore production
                1000, 800)            # I-J: high spore (similar to A-D)

drought_tol <- c(25, 15, 30, 20,       # A-D: low drought tolerance (%)
                 85, 75, 80, 90,       # E-H: high drought tolerance
                 80, 70)              # I-J: high drought tolerance (like E-H)

# Combine into a data frame
trait_data <- data.frame(Species = species,
                         HyphalExtensionRate = hyphal_ext,
                         EnzymeActivity = enzyme_activity,
                         SporeProduction = spore_prod,
                         DroughtTolerance = drought_tol,
                         stringsAsFactors = FALSE)

# Take a quick look at the dataset
trait_data
##      Species HyphalExtensionRate EnzymeActivity SporeProduction
## 1  Species_A                  10              2             900
## 2  Species_B                   9              4            1100
## 3  Species_C                  12              3             950
## 4  Species_D                  11              2            1200
## 5  Species_E                   2              9              80
## 6  Species_F                   4              8             150
## 7  Species_G                   3              7             120
## 8  Species_H                   5              9              90
## 9  Species_I                   6              4            1000
## 10 Species_J                   7              5             800
##    DroughtTolerance
## 1                25
## 2                15
## 3                30
## 4                20
## 5                85
## 6                75
## 7                80
## 8                90
## 9                80
## 10               70

Exploratory Data Analysis

It’s good practice to do some quick exploration of a new dataset:

# Explore the data
str(trait_data)           # structure of the data frame
## 'data.frame':    10 obs. of  5 variables:
##  $ Species            : chr  "Species_A" "Species_B" "Species_C" "Species_D" ...
##  $ HyphalExtensionRate: num  10 9 12 11 2 4 3 5 6 7
##  $ EnzymeActivity     : num  2 4 3 2 9 8 7 9 4 5
##  $ SporeProduction    : num  900 1100 950 1200 80 150 120 90 1000 800
##  $ DroughtTolerance   : num  25 15 30 20 85 75 80 90 80 70
summary(trait_data[, -1]) # summary of trait columns (excluding Species name)
##  HyphalExtensionRate EnzymeActivity SporeProduction  DroughtTolerance
##  Min.   : 2.00       Min.   :2.00   Min.   :  80.0   Min.   :15.00   
##  1st Qu.: 4.25       1st Qu.:3.25   1st Qu.: 127.5   1st Qu.:26.25   
##  Median : 6.50       Median :4.50   Median : 850.0   Median :72.50   
##  Mean   : 6.90       Mean   :5.30   Mean   : 639.0   Mean   :57.00   
##  3rd Qu.: 9.75       3rd Qu.:7.75   3rd Qu.: 987.5   3rd Qu.:80.00   
##  Max.   :12.00       Max.   :9.00   Max.   :1200.0   Max.   :90.00

We exclude the first column (Species) in summary because it’s not numeric. The summary will show basic stats (min, median, mean, max) for each trait.

You can already see differences in scale: e.g. SporeProduction numbers are much larger (hundreds) compared to the others. HyphalExtensionRate ranges roughly 2–12, EnzymeActivity 2–9, DroughtTolerance 15–90 (%). Because of these scale differences, we will need to normalize the traits before PCA and distance calculations, or else SporeProduction (with values ~1000 vs ~100) would dominate the variance.

2. Data preparation: Scaling the Traits

For PCA (and clustering based on Euclidean distance), it’s crucial to scale the traits so that they are comparable. Typically we scale each trait to mean 0 and standard deviation 1 (also known as z-scores). This way, a difference of, say, 50 in spore count (which might be small relative to its range) is on the same footing as a difference of 5°C in a growth temperature trait (if we had one, for example). In R, we can scale easily:

# Scale the trait variables (exclude Species column)
trait_data_numeric <- trait_data[, -1]             # drop Species column, keep numeric traits
trait_data_scaled <- scale(trait_data_numeric)     # scale() centers and scales columns by default

# Check that scaling worked:
colMeans(trait_data_scaled)        # should be ~0 for each trait
## HyphalExtensionRate      EnzymeActivity     SporeProduction    DroughtTolerance 
##       -6.800116e-17       -2.775558e-18        1.110223e-17        2.775558e-17
apply(trait_data_scaled, 2, sd)    # should be ~1 for each trait
## HyphalExtensionRate      EnzymeActivity     SporeProduction    DroughtTolerance 
##                   1                   1                   1                   1

We create trait_data_numeric as just the numeric columns for convenience. After scaling, it’s wise to verify the result – the means should be essentially 0 and SDs 1 (due to rounding, you might see extremely small scientific notation values like 1e-16 instead of 0). Now our data is ready for PCA and clustering.

3. Principal Component Analysis in R

We will use the base R function prcomp() for PCA. Since we already scaled the data, we can tell prcomp not to scale again by setting scale. = FALS:

# Perform PCA on the scaled trait data
pca_result <- prcomp(trait_data_scaled, center = FALSE, scale. = FALSE)
# If we hadn't scaled manually, we would do: prcomp(trait_data_numeric, center=TRUE, scale.=TRUE)

# Print PCA summary
summary(pca_result)
## Importance of components:
##                           PC1     PC2     PC3     PC4
## Standard deviation     1.8957 0.49515 0.31924 0.24354
## Proportion of Variance 0.8984 0.06129 0.02548 0.01483
## Cumulative Proportion  0.8984 0.95969 0.98517 1.00000

The summary(pca_result) will show the Proportion of Variance explained by each principal component (PC). Because we have 4 original traits, we will get 4 PCs. We expect the first one or two to explain most of the variance if our traits are highly correlated (which they are by construction).

We can also examine the PCA loadings and scores:

  • Loadings (also called rotation in prcomp output) tell us how each trait contributes to each principal component. These are found in pca_result$rotation. Essentially, it’s a matrix where columns are PCs and rows are our original traits, and values are the coefficients (usually between -1 and 1) for the linear combination. For example, if PC1 has a high positive loading for EnzymeActivity and DroughtTolerance, and a negative loading for HyphalExtensionRate, that means PC1 almost can be seen as a “high enzyme & tolerance vs. low growth” axis.

  • Scores (coordinates of each species on the PCs) are in pca_result$x. Each row corresponds to a species, and columns are PC1, PC2, etc. These tell us where each species lies in the new reduced-dimensional space.

# PCA loadings (how traits map to PCs):
pca_result$rotation
##                            PC1        PC2         PC3         PC4
## HyphalExtensionRate -0.5052261  0.2290401 -0.82217833  0.12771068
## EnzymeActivity       0.5091733  0.3596199 -0.09208987  0.77648925
## SporeProduction     -0.4990295 -0.5583885  0.24664158  0.61509327
## DroughtTolerance     0.4862701 -0.7116294 -0.50468821 -0.04913958
# PCA scores (species coordinates on PCs):
pca_scores <- pca_result$x
head(pca_scores, 10)   # print scores for all 10 species (or use head if more cases)
##               PC1         PC2         PC3          PC4
##  [1,] -1.85152529  0.21066098  0.04681947 -0.422888781
##  [2,] -1.70939493  0.40182028  0.48772972  0.384027949
##  [3,] -1.93021473  0.29632309 -0.51606880 -0.009540253
##  [4,] -2.39671543  0.03561297  0.05162958  0.016254420
##  [5,]  2.44079157  0.17239291  0.27460225  0.084470346
##  [6,]  1.73052404  0.32400947  0.03834687 -0.016198171
##  [7,]  1.80267117  0.04613242  0.20932698 -0.382714485
##  [8,]  2.07441469  0.24089645 -0.51225096  0.199672049
##  [9,] -0.12683590 -1.19864032  0.06451442  0.037322297
## [10,] -0.03371518 -0.52920826 -0.14464953  0.109594628

Interpreting the loadings is key to understanding the PCs. Look at the signs and magnitudes for PC1 and PC2 especially.

Visualise PCA plot

Next, let’s visualize the PCA results. A common approach is a biplot or plotting PC1 vs PC2. We can use the base R biplot function, which plots species scores and trait loadings together:

# Basic PCA biplot (PC1 vs PC2):
biplot(pca_result, scale = 0)

By default this will plot PC1 on the x-axis and PC2 on the y-axis. The points (usually labeled by rownames or numbers) represent the species in PCA space, and the arrows represent the traits (loadings). Because we provided no explicit labels, species might be labeled as numbers 1–10 corresponding to the row order. To improve this, we can add our species names as labels:

# Improved biplot with custom labels:
biplot(pca_result, scale = 0, xlabs = trait_data$Species)

(If you prefer ggplot2, one can also make PCA plots by extracting pca_scores and plotting PC1 vs PC2 with species labels. Another quick method is autoplot(pca_result) from the ggfortify package)

ggfortify lets ggplot2 know how to interpret PCA objects. After loading ggfortify, you can use ggplot2::autoplot function for stats::prcomp and stats::princomp objects.

library(ggplot2)
library(plotly)
library(ggfortify)

p <- autoplot(pca_result) #basic plot
p <- autoplot(pca_result, data = trait_data, colour = "Species") #color points based on metadata

p <- autoplot(pca_result, data = trait_data, colour = "Species",
              loadings = TRUE, loadings.colour = 'blue', 
              loadings.label = TRUE, loadings.label.size = 3) #display eigenvectors and attach eigenvector labels

ggplotly(p) #basic plot

4. Hierarchical Clustering

Now that we’ve explored the continuous trait space, let’s perform clustering to explicitly group the species.

We need to decide on a distance metric and a linkage method. For simplicity, we’ll use Euclidean distance on the scaled trait data (this is the straight-line distance in 4-dimensional trait space between species). Since our PCA was also based on Euclidean variance, this is consistent. For linkage, we use the defult method complete (uses the furthest distance between points in two clusters) here.

# Compute Euclidean distance on scaled data
dist_matrix <- dist(trait_data_scaled, method = "euclidean")

# Perform hierarchical clustering
hc <- hclust(dist_matrix)

# Plot the dendrogram
plot(hc, labels = trait_data$Species, main="Hierarchical Clustering of Fungi")

To make concrete clusters, we can cut the dendrogram at a certain number of clusters. Let’s say we suspect 3 clusters (fast group, slow group, intermediate group):

# Cut tree into 3 clusters
cluster_membership <- cutree(hc, k = 3)
cluster_membership
##  [1] 1 1 1 1 2 2 2 2 3 3

This will output a numeric cluster label (1,2,3) for each species in order. You can also table it against species:

# Cut tree into 3 clusters
# See which species are in which cluster
data.frame(Species = trait_data$Species, Cluster = cluster_membership)
##      Species Cluster
## 1  Species_A       1
## 2  Species_B       1
## 3  Species_C       1
## 4  Species_D       1
## 5  Species_E       2
## 6  Species_F       2
## 7  Species_G       2
## 8  Species_H       2
## 9  Species_I       3
## 10 Species_J       3